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Algorithms for Efficient Mining of Statistically 
Significant Attribute Association Information 

Pritam Chanda, Aidong Zhang, and Murali Ramanathan, 

Abstract — Knowledge of the association information between the attributes in a data set provides insight into the underlying 
structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes and class 
(if present). Complex models learnt computationally from the data are more interpretable to a human analyst when such 
interdependencies are known. In this paper, we focus on mining two types of association information among the attributes - 
correlation information and interaction information for both supervised (class attribute present) and unsupervised analysis (class 
attribute absent). Identifying the statistically significant attribute associations is a computationally challenging task - the number of 
possible associations increases exponentially and many associations contain redundant information when a number of correlated 
attributes are present. In this paper, we explore efficient data mining methods to discover non-redundant attribute sets that contain 
significant association information indicating the presence of informative patterns in the data. 

Index Terms — Information theory, Entropy, Attribute Association, Correlation, Interaction. 
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1 Introduction 

Many applications in various fields of scientific research, 
economics, financial and marketing applications produce 
multi-dimensional data sets in which complicated interde- 
pendencies exist between the attributes of data, such as 
independence, correlation, synergy, and redundancy. Data 
mining and statistical techniques have been employed to 
make sense of these data sets, to discover useful patterns 
and models in the data that aid explaining how the system 
being represented works. To discover key patterns in the 
data, it is necessary to find relationships or associations 
between the attributes in the data that help to explain the 
interdependencies among the attributes. Exploring attribute 
association patterns enable deeper insight into the data, are 
useful for understanding probabilistic models representing 
the data and possibly allow one to gain practical knowledge 
from the model(s)computationally learnt using the data. 

From an information theoretic perspective, association 
information between attributes can be broadly categorized 
into (1) correlation information and (2) interaction infor- 
mation. The correlation information of an attribute set 
represents the total amount of information shared among 
the attributes; equivalently, it can be viewed as a general 
measure of dependency. The interaction information of an 
attribute set captures the multivariate dependencies between 
the attributes which is not present in any subset of the given 
set. These two are related and complements each other in 
discovering useful patterns and relationships in the data. 
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2 Background and Significance 

In this paper, we study the problem of mining the above 
two types of association information that are statistically 
significant in discrete data for both supervised (i.e. when 
a class label attribute is present) and unsupervised analysis 
(no class label is present). Note that the two analysis 
methods are different because in the first case we need 
to find sets of attributes that have significant association 
information with one another, while in the second case 
we need to find attributes that have significant association 
information for the class attribute. Finding these types of 
associations have important implications in many fields of 
study. For example, in a biological or genetic context, the 
risk of developing many common and complex diseases 
such as different forms of diabetes, mental illness, cancer, 
autoimmune and cardiovascular diseases involves complex 
interactions between multiple genes and several endoge- 
nous and exogenous environmental factors. For many com- 
mon diseases, individually each gene (or single nucleotide 
variations on that gene) have weak statistical associations 
with the disease, however, together they act in concerted 
fashion (often with several non-genetic factors e.g. gender, 
age, smoking habits, drinking habits) to control the expres- 
sion of the disease (T), Q. The successful detection of 
such genetic associations can provide the scientific basis 
for many underlying biological interactions, improves the 
prospects for uncovering potentially undiscovered genes 
involved in the disease process and helps to develop preven- 
tative and curative measures for particular genetic and non- 
genetic susceptibilities. Besides genetics, the usefulness 
of exploring association information is also important in 
supervised learning problems such as feature selection 
where the task is to find a subset of the features that 
improve the accuracy of a classifier. A statistical association 
between two attributes exists when the joint effect of 
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both in a model is different from that obtained by addi- 
tively combining the individual effects. Associations among 
the attributes are specially important for understanding 
an appropriate probabilistic model representing the data 
and subsequent feature selection. Discovering associations 
between the attributes in a data set provides insight into 
the underlying structure of the data and explains the rela- 
tionships (independence, synergy, redundancy) between the 
attributes. Complex models learnt computationally from the 
data are more interpretable to a human analyst when such 
interdependencies are known. 

3 Related Work 

Mining correlation information in high-dimensional dis- 
crete data has attracted much research interest in recent 
years. Various approaches have been developed, including 
correlation pattern mining 0, 0, 0, feature selection 
0, 0, El, finding correlated item pairs [9|, and others. 
Mining correlation information is also closely associated 
with mining frequent patterns in the data. It roots from the 
association rule mining problem introduced in the Apriori 
algorithm 1101 . Since then much work has been done on 
frequent pattern mining with itemsets, constrained rule min- 
ing, measuring interestingness of association rules mined 
and so on. Traditionally support and confidence and related 
measures have been used to assess the usefulness of the 
rules mined. Correlation pattern mining was achieved with 
a statistical basis in ifTTI where the authors have used x 2 
correlation measure between pairs of attributes. Information 
theory based metrics like entropy has also been used as a 
quality measure for sets of attributes (or items) and efficient 
algorithms have been proposed to mine the maximally 
informative fc-itemsets as in lfl2l . Algorithms have been 
proposed to find low-entropy sets as in [131 where they in- 
troduced two kinds of low entropy trees and discussed their 
properties. In the NIFS method fl4l . the authors explore the 
problem of finding non-redundant high order correlations in 
binary data and propose pruning strategies by investigating 
the bounds of multi-information which is a generalization 
of pair-wise mutual information. Their proposed pruning 
methods are based on hard thresholds which is difficult 
to set unless pre-determined using trial and error. Here 
we derive bounds on correlation information for both su- 
pervised and unsupervised analysis, use pruning strategies 
using bounds on correlation information, however, instead 
of hard thresholds, we employ the distributional properties 
of correlation information which improves the power of 
our method in the presence of noise in the data. Also our 
bounds are based on entropy inequalities and therefore not 
restricted to binary data. Using experimental data sets, we 
further show that our methods can identify attribute sets 
(we call them special combinations of interest) which are 
not detected in 1141 and also mine interaction information 
among the attribute sets and use a novel fast permutation 
strategy to evaluate the statistical significance of interaction 
information of attribute sets. 

Compared with correlation information, interaction in- 
formation is a more parsimonious measure of associa- 



tion. Interaction information between variables and at- 
tributes was researched upon in diverse areas like physics, 
information theory, neuroscience, game theory, law and 
economics. The concept was first introduced by McGill 
031 as a multivariate generalization of Shannon's mutual 
information |[T6l . Later, Han IfTTI gave rigorous formal 
definitions of the concepts of interaction while properties 
of positive and negative interactions appeared in fT8l . 
In physics, Cerf |[T9l analyzed interaction information of 
three variables in quantum physics, while Matsuda ll20l 
studied properties of interaction information (referred to 
as higher order mutual information functions) for general 
complex systems. Bell EH defined co-information forming 
a partially ordered lattice in terms of the entropies and 
used it for dependent component analysis. More recently, 
Jakulin l22l . 11231 studied it extensively from a machine 
learning perspective and provided methods for visualizing 
interactions and interpreting the structure in the data. 

Correlation measures such as Pearson's correlation, 
Spearman's rank correlation, Kendall tau correlation and 
chi square measures are common examples of first order 
association measures used to evaluate individual attribute 
dependencies (synergy with class label) or relevance of an 
attribute in predicting the class label. Associations among 
attributes have been used for feature selection directly or 
indirectly in various data mining and machine learning 
applications, however, most of these consider only first 
order associations (mutual information) l24l . 0, 0. Mu- 
tual information was also used as a similarity measure for 
clustering instances l25l . fl26ll . Also, in mining attribute 
associations it is important to consider the presence of 
correlated attributes as this results in several associations 
that contain redundant information regarding the class label. 
Feature selection methods that explore means to reduce 
redundancy among the attributes are studied by some 
researchers (27), 0, l28l . For example in 0, the authors 
devise a minimal-redundancy-maximal-relevance (mRMR) 
criterion using information theoretic methods to reduce re- 
dundancy and select promising features. In CfsSubsetEvalu- 
ation [27 1 subsets of features that are highly correlated with 
the class while having low inter-correlation are preferred. 
Although methods as in |29) (GRAD) and ED directly 
or indirectly considers higher order associations, they do 
not address the problem posed by the presence of large 
number of correlated variables in the data. Mining highly- 
correlated association patterns are also explored in Bill . 
J32). An important difference of our work from others is 
that we mine higher order association information, consider 
redundancy among the attributes instead of simple pairwise 
correlations between attributes as in l27l . and use 
statistical significance based pruning strategies (unlike (3 111, 
(32], 0) to improve the efficiency of our search methods 
for both supervised and unsupervised analyses. 

4 Contributions of the paper 

Mining the significant attribute associations in a high di- 
mensional data set is a computationally challenging task - 
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the number of possible associations increases exponentially 
because all possible subsets of the attributes need to be 
considered and most of these associations contain redundant 
information when a number of correlated attributes are 
present. Although, in practice, the attribute associations of 
interest that are meaningful are much fewer in number 
compared to all possible associations, the high dimen- 
sionality of the data sets makes the number of relevant 
attribute associations very large. Exploring all subsets of 
attributes for significant association information becomes 
computationally intractable as the number of attributes 
increases. In this paper, we study the problem of mining 
statistically significant correlation information and interac- 
tion information in discrete data in both unsupervised and 
supervised contexts. Our work is based on the concepts 
developed in l33l where we have developed the algorithms 
for unsupervised analysis only. In this paper, we do the 
following: 



1) We present the information theoretic metrics repre- 
senting interaction information (termed K-way inter- 
action information or KWII), correlation information 
for unsupervised analysis (termed total correlation 
information or TCI) and correlation information for 
supervised analysis (termed class associated correla- 
tion information or CACI). 

2) We demonstrate and prove the relationships between 
the above association information metrics. 

3) We derive the distributional properties of TCI and 
CACI for evaluating statistical significance or corre- 
lation information. 

4) We develop a method for fast evaluation of statis- 
tical significance of the interaction information (i.e. 
KWII). 

5) For both supervised and unsupervised cases, we 
propose the concepts of attribute combinations con- 
taining highly significant, moderately significant and 
non-significant correlation information. These are 
used to formulate combinations of interest as highly 
significant attribute sets that have all subsets with 
non-significant correlation information, and special 
combinations of interest that can have at most one 
subset with highly significant correlation information. 

6) We present bounds on correlation information (both 
TCI and CACI) and develop several pruning strate- 
gies utilizing these bounds to efficiently prune the 
search space. 

7) Using the bounds and pruning strategies, for unsu- 
pervised cases, we develop the algorithms correlation 
information miner (CIM) and interaction information 
miner (IIM). We also develop the correlation infor- 
mation miner class associated (CIMca for supervised 
cases). 

8) Using several experimental and a real-life data set, 
we critically examine the effectiveness and efficiency 
of our proposed mining algorithms. 



5 Association Information Metrics 

In this section, we introduce some basic notations that we 
shall use throughout the paper. In the rest of the paper, the 
term combination is also used to refer to a set of attributes. 
A given data set D is represented as a m X n matrix of 
discrete values where each row is a sample and each column 
is an attribute. Let ( = {Ai; A n } be the set of 

attributes in D. We treat Aj as a discrete random variable 
and p(a,) represents the probability density function of 
Ai. Also, the words 'combination' and 'set' are used 
interchangeably in the paper referring to a collection of 
attributes. 

Definition 1: The uncertainty of a discrete random vari- 
able Ai is defined by Shannon's entropy lfl6l as, 

H{Ai) = -^p{a i )log(p{a i )) 

aev. 

Definition 2: The interaction information among the k 
attributes (fc-way interaction information or KWII) in set 
S = {Ai; Ai\ Ak}, S C £, is the multivariate gener- 
alizations of Shannon's mutual information. It is defined 
as the amount of information (synergy or redundancy) that 
is present in the set of attributes, which is not present in 
any subset of these attributes 11221 . The KWII can be 
written succinctly as an alternating sum of the entropies 
of all possible subsets r of S using the difference operator 
notation of Han ifTTl : 

KWII(S) = -J2 (-1) {SVI H(t) 

tCS 

The number of attributes k in a combination is called 
the order of the combination. KWII quantifies interactions 
by representing the information that cannot be obtained 
without observing all k attributes at the same time. 

In the bivariate case, the KWII is always nonnegative 
but in the multivariate case, KWII can be positive or nega- 
tive(positive values indicate synergy between the attributes, 
negative values indicate redundancy between attributes, and 
a value of zero indicates the absence of k-way interactions). 

Definition 3: The Total Correlation Information (TCI) 
involving attributes in set S = {A\; A^} is defined 
ED El as, 

k 

TCI(S) = J2 H ( A i)- H ( A ^-> A k) 

i=l 

/ \ i / p(ai...a k ) 

ai a k p{ ai )...p{a k ) 

The TCI is the total amount of information shared among 
the attributes in the set. A TCI value that is zero indicates 
that the attributes are independent and the maximal value 
of TCI occurs when one attributes is completely redundant 
with the others. An important property of the TCI is 
that it is always non-negative and increases monotonically 
with increasing combination size i.e., TCI(Ai; ■ ■ ■ ; A k ) < 
TCI(Ai] ■ ■ ■ ;Ak; Ak+i). Next we examine the correlation 
metrics in a supervised analysis where a class label attribute 
is present that specifies the labels of each instance in the 
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data. First note that the TCI can be used to calculate 
the correlation information by treating the class attribute 
just as one of the attributes in a combination. However, 
the correlation information represented by the TCI is not 
free from unnecessary confounding information that does 
not involve the class attribute. For example, say we are 
given data with three predictor attributes A\, A 2 , and A3 
and a class attribute C. The value of TCI(Ai; A 2 ; A 3 ;C) 
will represent the overall correlation information among 
these attributes which contains several components viz. 
KWII{A 1 ;A 2 ), KWIIiAnAs), KWII{A 2 ;A 3 ) and 
KWII{A\] A 2 ; A3) which do not contain C and any in- 
formation related to C. We therefore present another metric 
called the Class Associated Correlation Information (or 
CACI) which is a non-overlapping sum of interaction infor- 
mation about the class attribute for the predictor attributes 
Ai,...Ak and the class C. The CACI is obtained from 
the measure representing the overall dependency among 
the predictor attributes and the class attribute by removing 
the contributions representing the interdependencies (e.g., 
correlations) among the predictor attributes not related to 
the class attribute. Accordingly, the CACI is defined by: 

Definition 4: The Class Associated Correlation Informa- 
tion (CACI) involving attributes in set S = {A\; Ak} 
and class C is defined as, 



CACI{S; C) = TCI(S; C) - TCI(S) 



pjflx, ■■■,ak,c) 
p(c)p(ai...a k ) 



■OD 



In the above definition, the TCI(Ai; A 2 ; Ak] C) term 
represents the overall dependency among the all the at- 
tributes and the class whereas the TCI(Ai; A 2 ; Ak) 
term represents the inter-dependencies only among the 
predictor attributes in the absence of the class attribute. 



5.1 Properties of TCI 

Proposition 1: The TCI increases monotonically with 
increased combination size. 

Proof: For k attributes Ai, A 2 , Ak, we have: 

fe 

TCI(A i; ...A k ) - TCI(A i; ...Ak-i) = £ H(Ai) 

i=l 

fc-1 

- H(A 1 ■ ■ -A k ) - H(Ai) + H{A 1 ■ ■ ■ A k -i) 



H(A k )-H(A k \A 1 ...A k -i)>0 



(2) 



The last inequality follows from the fact that the entropy 
of Ak decreases when information from A\,- ■ ■ A^-i is 
known (the vertical bar represents conditional entropy). □ 

Here, we state the theorems demonstrating the rela- 
tionships between the above mentioned two information 
theoretic metrics l33ll . 

Theorem 1: The TCI of an attribute set S represents the 
sum of all KWII between two or more attributes from S, 
i.e., TCI(S) = Ezcs.\z\>2KWII(Z) 



5.2 Properties of CACI 

Theorem 2: The CACI of an attribute set S and 
C represents the sum of all KWII between one or 
more attributes from S and C, i.e., CACI(S;C) = 
Ezcs,\ Zl >iKWII(Z;C) 

Proof: For the set S = {Ai; A k } and class C, from 
definition |3] we have, 



TCI{S; C) 



k 

E 

i=l 



H(Ai) + H(C)~H(S;C) 



= J2H(Ai)-H(S)+H(C) + H(S)-H(S;C) 

i=l 

= TCI(S) + TCI(A 1 ...A k ;C) (3) 

Thus using theorem Q] 

TCI(A 1 A 2 ...A k ; C) = TCI(S; C) - TCI(S) 

KWII{v) - ^ KWII(u) 

u£{S;C},\v\>2 we{S},M>2 

J2 KWIIfcC) (4) 

£6{S},|£|>1 

The term TCI{A 1 A 2 ...A k \C) is the TCI between the 
joint distribution of the k attributes and the class attribute; 
the TCI(S) = TCI{Ar, A K ) term is the TCI among the 
k attributes and TCI(S;C) = TCI{A X ;A 2 \...;A K ;C) is 
the TCI among the k attributes and the class. The above 
equation is the sum of all possible interactions involving 
attributes Ai, A 2 , A k ,C that contains the class attribute 
C. This is defined as the Class Associated Correlation 
Information or CACI. Thus, 



CACI(S; C) = 



TCI(S;C)-TCI(S) 
KWimC) 



(5) 



?e{s},l?]>i 



Because information content of each KWII is non- 
redundant (or non-overlapping) with every other combina- 
tion and the CACI can be expressed as a sum of KWII 
values, the CACI is a non-overlapping sum of information 
about the class attribute. □ 
Proposition 2: CACI is always greater than or 
equal to zero and increases monotonically with 
increased combination size (i.e. CACI(A\;.,.A k ;C) > 
CACI(A 1 ;...A k _ 1 ;C)). 

6 Problem Formulation 

In this section, we shall develop a problem formulation 
common to both supervised analysis (i.e. class attribute 
present) and unsupervised analysis (i.e. class attribute ab- 
sent) and will use either CACI or TCI. For the ease 
of presentation, lets denote either CACI or TCI by the 
term CI (standing for correlation information). Whereever 
applicable, we shall distinguish between the two by using 
the actual names (CACI or TCI). First we introduce the 
concepts of Combinations of Interest (or COI) and Special 
Combinations of Interest (or SCOI). A COI is an attribute 
set containing high CI such that its proper subsets have low 
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CI, while a SCOI is similar to the COI but can have exactly 
one proper subset to have high CI. Our definitions of high 
and low are based on statistical significance levels which 
is based on distributional properties explored in section 7. 
Broadly, our goal is to mine the COI, SCOI and combina- 
tions with high KWII that represent attribute sets containing 
non-redundant association information either with class (for 
supervised studies) or without class. To develop our mining 
strategy, we first give some formal definitions. 

6.1 Definitions for the Unsupervised Case 

First we present the definitions assuming no class attribute 
is present. Our definitions use the common statistical 
concept of Pvalue. Given an observed value of a test 
statistic, Pvalue is defined as probability of obtaining a 
value more extreme than the given one, under the null 
distribution of the test statistic. Assume that we know the 
probability distribution function of the TCI. Let anigh and 
atLow be two given significance levels for determining the 
statistical significance of an observed value of TCI such 
that < a Hlg h < a Low . Let S = {Ax; • • • ; A k } C ( be a 
given set of attributes. 

Definition 5: S has statistically Highly Significant cor- 
relation information if Pvalue(TCI(S)) < anigh- We 
refer to such a combination of attributes as Highly Sig- 
nificant Combination or HSC. 

Definition 6: S has statistically Non-Significant corre- 
lation information if Pvalue{TCI(S)) > aLow- We refer 
to such a combination of attributes as Non-Significant 
Combination or NSC. 

Definition 7: S has statistically Moderately-Significant 
correlation information if ctHigh < Pvalue (TCI(S)) 
< O-Low Such a combination of attributes is called a 
Moderately-Significant Combination or MSC. 
For example, setting amgh = 10~ 10 and ai ow = 10~ 3 , a 
Pavlue of 10~ 12 will be Highly Significant while that of 
0.01 will be Non-Significant. 

Definition 8: S is a Combination Of Interest (or COI) 
if it satisfies:- 

1) S is a HSC, and 

2) Each proper subset of S is a NSC. 

However checking all proper 2 k ~ 1 subsets of S is com- 
putationally expensive. Let S k ~i C S with k—1 attributes. 
From the monotonic increasing property of the TCI (prop- 
erty (3) in definitionEJ, TCI(S) > TCI(S k -i). Therefore, 
we make the assumption that if Pvalue{TCI(S)) > aLow, 
then Pvalue{TCI(Sk-if) is also > aLow as smaller TCI 
value usually has lower significance. As a result, we only 
need to check whether the k — 1 size subsets of S are NSC. 

The definition of COI is based on the fact that if S is 
a HSC and one or more of its subsets are HSC or MSC, 
then S has redundancy as it has at least one subset with 
high correlation information. For example, assume set S = 
{Ai; A 2 ; A 3 ; A 4 } is a HSC and its subsets S' = {A\\ A 2 } 
and S" = {A3; A4} are also HSC. In this case, mining S' 
and S" are sufficient to capture all the interacting attributes. 
However, this is a strict condition that need to be relaxed 
to capture more information as seen in the next definition. 



Definition 9: Let T k denote the set of all subsets of S 
with k—1 attributes. S is a Special Combination Of Interest 
(or SCOI) if it satisfies:- 

1) S is a HSC, 

2) Exactly one member (say set X) <E is a HSC and 
all others are NSC, and 

3) A TC i = TCI(S)-TCI(X) is statistically significant at 
significance level a High- 

Let X = S\{A k }- Then, it can be easily shown that 
A TCI = H(A k ) + H(X) - H{S) = TCI(X;A k ), where 
X represents a new attribute formed by the joint of all 
attributes in X. The motivation behind the definition of 
SCOI is based on the following example. Assume set 
S = {A±; A 2 ; A 3 ; A±} is a HSC and only its subset S' = 
{A 1 ;A 2 ;A 3 } is a HSC. If A TC i = TCI(AxA 2 A 3 ; A4) is 
significant, A4 is contributing significantly to the increased 
correlation information. If we only mine S and not S', we 
lose important association information contributed by A4 
only in combination with S. 

6.2 Definitions for the Supervised Case 

Assume that we know the probability distribution function 
of the CACI. Let anigh and aLow be two given significance 
levels for determining the statistical significance of an 
observed value of CACI such that < a High < aLow- Let 
S c = S c = S U {C} be a given set of attributes including 
the class attribute. 

Definition 10: S c has statistically Highly Signifi- 
cant class associated correlation information if Pvalue 
(C 'ACI(Sc)) < anigh- We refer to such a combination 
of attributes as Highly Significant Combination Class 
Associated or HSCca- 

Definition 11: S c has statistically Non-Significant class 
associated correlation information if Pvalue (CACI(S C )) 
> aLow- We refer to such a combination of attributes 
as Non-Significant Combination Class Associated or 
NSC CA . 

Definition 12: S c has statistically Moderately- 
Significant class associated correlation information 
if anigh < Pvalue (CACI(S C )) \ aLow- Such a 
combination of attributes is called a Moderately- 
Significant Combination Class Associated or MSCca 

Again following the definitions we presented for the 
unsupervised, in presence of C, we have, 

Definition 13: S is a Combination Of Interest class 
associated (or COIca) if it satisfies:- 

1) S c is a HSCca, and 

2) Each proper subset of S c is a NSCca- 

However checking all proper 2 k ~ 1 subsets of S c is 
computationally expensive. Following the same argument 
as in definition of COI, because CACI also has a monotonic 
increasing property, we only need to check whether the k—1 
size subsets of S c are NSC. 

Finally we define the case analogous to SCOI, 
Definition 14: Let T k denote the set of all subsets of S c 
with k — 1 attributes such that each subset contains C. S c 



7 



is a Special Combination Of Interest class associated (or 
SCOIca) if it satisfies:- 

1) S c is a HSCca, 

2) Exactly one member (say set X c ) S Tk is a HSCca 
and all others are NSCca, and 

3) A c ac i = CACI(S' C )-CACI(X C ) is statistically signif- 
icant at significance level amgh- 

The motivation behind the definition of SCOIca is 
based on the following example. Assume set S c = 
{Ai; A2; A3; A4; C} is a HSCca and only its subset S' c = 
{Ai; A 2 ; A 3 ; C} is a HSCca- If Acai is significant, A4 
is contributing significantly to the increased correlation 
information with C. If we only mine S c and not S' c , 
we lose important class related association information 
contributed by A4 only in combination with S c . 

6.3 Redundancy Considerations 

Next, we consider correlations among data attributes (e.g. 
linkage disequilibrium in genetic data) which can result 
in redundancy (i.e. presence of overlapping information) 
among the attribute combinations. First we present the case 
for unsupervised analysis. Using the property that KWII is 
negative in presence of redundancy, we have, 

Definition 15: Two attributes Ai and Aj are redundant if 

RediA; A,) = ™H{£$ti» ^ ~ A ' where ^ A ^ 
1 is a user specified threshold. 

The definition is based on the fact that if Ai and Aj 
have high redundancy, they are in fact interacting, i.e, 
Ai explains Aj very well. Also Aj completely explains 
itself (Aj) causing the expression KWII(Ai' : Aj\ Aj) to 
have redundant information. The denominator is used to 
normalize the KWII and is based on the easy to prove fact 
that KWII{A t ;A 3 ;A k ) < min{H(A i ),H(A j ),H(A k )}. 
In presence of a class attribute C, we have, 
Definition 16: Two attributes Ai and Aj are redundant 
in the context of C if Red^Aj) = kw u(a^a 3 ;C) < 
— Ac a, where < Ac a < 1 is a user specified threshold 
in the presence of a class variable. 

In the above definition, if the variables Ai and Aj are 
redundant, they have similar information about C, as a 
result, the KWII(Ai;Aj-,C) will have redundant infor- 
mation making it negative. 

6.4 Mining Strategy 

Compared with the TCI or CACI, the KWII is a more 
valuable information metric because it is a parsimonious 
measure of association for the attribute combination of in- 
terest alone and does not contain contributions from lower- 
order combinations 11221 . However, KWII alone cannot 
be used to device an efficient mining algorithm because 
it takes on both positive and negative values. Only all 
individual and joint entropies are needed for a TCI or CACI 
calculation, making it computationally far more tractable 
than the KWII. Both the TCI and CACI are always non- 
negative and increases monotonically with increased com- 
bination size making it potentially suitable for our mining 
algorithm. In the unsupervised case, from theorem Q] the 



TCI represents the cumulative synergy present in all subset 
combinations of the attribute set {Ai;A2;-- - ;Ak}- Our 
goal is therefore to use the TCI in our mining algorithm to 
identify the regions in the combinatorial space (the COI 
and the SCOI) that contain potentially high correlation 
information (and therefore high interaction information) 
and then compute the KWII for the reduced combinatorial 
space. As a result, we shall concomitantly mine attribute 
sets containing useful correlation information (i.e. TCI) and 
interaction information (i.e. KWII). Similarly, in presence 
of a class variable, we shall use the CACI to identify regions 
in the combinatorial space containing high class associated 
correlation and interaction information 

Given a maximum order of combinations to explore (K) 
and a pair of significance levels (ctHigh, ctLow), our strategy 
of mining combinations with significant TCI (or CACI) and 
KWII broadly consists of two steps :- 

1) Mine all combinations that are COI and SCOI (or 
COIca and SCOIca), and 

2) If v is the set of attributes present in combinations 
mined in step 1, compute KWII(r) of all subsets 
t C v, s.t. t < K (or, in presence of class attribute 
C, if v is the set of predictor attributes present in 
combinations mined in step 1, compute KWII(t;C) 
of all subsets tCi/, s.t. t < K. 

In step 1, we explore the search space in a breadth-first 
manner that results in a set enumeration tree as shown in 
Figure Q] When mining for COI and SCOI (or COI C a and 
SCOIca), computing the TCI (or CACI) of every attribute 
set is time consuming, therefore, in the next section we 
shall develop upper and lower bounds of TCI (or CACI) 
based on that of its parent/ancestor/sibling nodes in the 
search space. We further develop pruning strategies using 
definitions of COI, SCOI (or COI C a and SCOI CA ) and 
redundancy (definitions l5llTo*ll. 
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Fig. 1: Sample tree enumeration of BFS for Unsupervised 
Mining. 



7 Correlation Information Bounds 

In this section, we present results on upper and lower 
bounds on TCI and CACI. The Palue computation on these 
bounds shall be used to speed up our mining strategy. 

7.1 Bounds on TCI 

In obtaining the upper and lower bounds, we shall assume 
TCI computations on the attribute set S = {Ax; ■ ■ ■ ; A^} Q 
( unless otherwise stated. 



Theorem 3: 

k 

TCI(S) > ^^H(Ai) - ^-[H(S\{A\}) 
+ H(S\{A 2 }) + H(A 1 ;A 2 )} 

The above theorem computes a lower bound on TCI(S) 
using entropy from the ancestor nodes. We first use it 
recursively in computing the upper bound of H(S) in a 
greedy fashion - first obtain its two-attribute subset (say 
(Ai] Aj)) with maximum pair-wise entropy and then recur- 
sively compute upper bounds of the entropies H(S\{Ai] 
and H(S\{Aj}). The upper bound on H(S) is then used 
to compute the lower bound of TCI(S). 
Theorem 4: 

TCI(S) < TCI(S\{A t }) + min{H(S\{A t }, H(A t ))} 

The theorem computes a upper bound on TCI(S) using 
TCI and entropy of its parent node {5'\{A t }} and H{A t ). 
The next two theorems are used to compute the upper 
and lower bounds of the node {S;Aj} using entropy of 
its sibling {S;Ai}, entropies of individual attributes and 
conditional entropies. Note that each conditional entropy 
of form H(Ai\Aj) is given by H(A i ;A J ) - H(Aj). 
Theorem 5: 

k 

TCI(S; Aj) > H(A t ) + H(Aj) - H(S; A t ) 



mm{H(Aj\A t )} 



Theorem 6: 



TCI(S] Aj) < H ( A t) + H(Aj) - H(S; A t ) + A CACI(S; C) 



Proof: We have CACI(S;C) = H(C) - H(C\S) 
so that CACI{S;C) < H(C). Again, CACI(S;C) = 
H(S) - H{S\C) so that CACI{S;C) < H{S). Thus 
clearly, CACI{S;C) < mm{H(C), H{S)}. But H(S) < 
^[H(S\{Ax}) + H(S\{A 2 }) + H(A 1 ;A 2 )} (Theorem 6.1 
eq 6.3 in 11331 , the result follows from that. □ 

8 Statistical Significance of Corre- 
lation and Interaction Information 

8.1 Probability Distribution of TCI 

In this section, we state results on the probability distribu- 
tion of the TCI using a Taylor series based approximation to 
the TCI |33l . This shall be used to evaluate the significance 
of the correlation information of anjittribute set. 

Theorem 9: The distribution of TCI(Ai; ■ ■ ■ ;A k ) can 
be approximated by a gamma distribution with scale pa- 
rameter = 1/(N ln(2)) and shape parameter = df TCI /2. 

Using theorem [9] the Pvalue of an observed TCI value t 
is given by Prob(TCI > t). 

Next we derive the probability distribution of the CACI 
random variable. 

8.2 Probability Distribution of CACI 

We derive the probability distribution of the CACI. The 
proof is very similar to the one for TCI. 

Theorem 10: Let S = {A\; A^} denote a set of 
variables and C be a class variable. Let A represent a new 
variable formed by the joint of all attributes in S. Then the 
CACI can be approximated as, 

(p(a, c) -p(a)p(c)) 2 



t=i 

k 

where, A = mm{H(A i \A :j ),mm{H(A j \A t )}} 

7.2 Bounds on CACI 

In obtaining the upper and lower bounds, we shall as- 
sume CACI computations on the attribute set S = 
{A\; ■ ■ ■ ;Ak} C ( and class variable C unless otherwise 
stated. A lower bound on CACI is given by the following 
theorem. 
Theorem 7: 



CACI(S;C) > H(C) - mm H(C\ A,) (6) 

i—1 



Proof: We have CACI(S;C) = H{A 1 ...A k ) + H(C) 
- H(A 1 ...A k C) = H(C) - H(C\A 1 ...A k ). Now the result 
follows from the fact that H{C\A 1 ...A k ) < H(C\A t ) Mi = 
l...k. □ 

The following theorem gives us an upper bound on 
CACI. 

Theorem 8: 

CACI(S; C) < 

mmi^HiS^}) + H(S\{A 2 }) + H{A i; A 2 )],C} 



2ln(2) 



E 



p(a)p(c) 



Proof: Let p(a) = ipi and p(S)p(c) = ip 2 . 
Let /(V>i) = p(S)log 2 (j^) = iM 0fla (£) = 

Using Taylor's expansion of f(ipi) about tpi = i/j 2 , we 
have, 

m) = fm + r^^P 1 + rm^^- + ... 

Here, /'(fc) = '"'*y) and m) = _1_ 

Therefore, ffa) = + asfetyi ~ + .... 

Ignoring higher order terms in the Taylor's expansion, 

CACI(S-C) = E au ..,a k m) =Ea u ..,a k ^ + 
2/n(2)V'2 _ + ■•■ ~Sai,..,a fe Tn{2) " ^2 ai ,..,a k 1^2) 

+ V„ „ ^$t7$tt—- The first two summations sum to 

i-~<a\,..,a k 2in(2)i/>2 

l/ln(2) resulting in theorem [TUl □ 
Again, the expression of CACI is related to the two- 
dimensional statistical \ 2 test El defined as, 



X 



E 



(Pi, 



( 2 ^iijta) 



Ei 



(7) 



where the summation is over the cells of the 2-dimensional 
contingency table, Oj 1; j 2 denotes the observed cell count 
and Ei lt i 2 denotes the expected cell count for cell i\,i 2 . 
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The degrees of freedom present in a 2— dimensional 
contingency table is df CACI = R1R2 — Ri — R2 + 1 
||35l . Here Ri denotes the count of distinct values that 
variable A can take, while R2 denotes the count of distinct 
values that variable C can take. Equating the observed and 
expected cell counts to the relative frequencies and the cell 
probabilities, it can be easily observed that, 



X 2 = 2 N ln(2) C AC I (An 



\A k -C) 



(8) 



where N denotes the total number of samples in the data 
(i.e. sum of cell counts in all cells of the 2-dimensional 
contingency table). CACI represents the approximation to 
the CACI metric. Using theorem [9] and equation [8] it can 
be easily proved that, 

^Theorem 11: The distribution of 

CACI(A\; ■ ■ ■ ;A k ;C) can be approximated by a 
gamma distribution with scale parameter = l/(Nln(2)) 
and shape parameter = df CACI /2. 

Using theorem [TT] the Rvalue of an observed CACI value 
t is calculated as ProbiCACI > t). 

8.3 Probability Distribution of A CACI 

Theorem 12: Let S c = {A x ; A 2 ; ■ ■ ■ ; A k \ C}, let X c 
= S\{A k } = {A 1 -A 2 ----;A k . 1] C}. Then A CACI = 
CACI(S C ) - CACI(X C ). Let \A k \ represent the num- 
ber of states of the attribute A k . Let A represent a 
new variable formed by the joint of all attributes in 
{Ai; A 2 \ ■ • ■ ; A k -i} and a represents its realizations. The 
distribution of Acaci can be approximated by a gamma 
distribution with scale parameter = 1/(N ln(2)) and shape 
parameter = \A\df CACI /2. 

Proof: First note that Acaci can be written as, 

,p(a k ,a, c)p(a) 



Acaci = ^2 p(a k ,a,c)log( 



o fc ,a,c 



p(c,a)p(a k ,a)' 



E, ■+ \, , P( a k,c\a) 
p{a k , a, c)log{ — — ) 
p(c\a)p(a k \a) 

a k ,a,c 

= s>P( a (a k ,c\a )log{ — r— 
s a^c p(c\a)p{a k \a) 

= Y,P( d )^a(A k ;C) (Let) (9) 

a 

Assuming the random variables Ax and C are inde- 
pendent given A, the expression $$(A k ;C) = because 
p(a k ,_c\A = a) = p(a k )P(c), p(a k \A = a) = p(a k ) and 
p(c\A = p(c). Therefore we can assume fyg(A k ;C) to be 
an independent random variable given each value a of the 
random variable A. Now note that given a specific value a 
of the random variable A, we have, 



Ei \, / Pa{a k ,c) 
Pa(a k , c)log( - 73 - 7 - - ) 



p s (c)p(a k )' 
CACI s (A k ;C) (10) 



In the above equation, p^ represents the probabilities 
calculated only using the data samples with A = 
a and CACIfi represents the corresponding CACI. 



Therefore ^^(A k ;C) is gamma distributed with scale 
1/Ngln(2), shape df caci/% an d moment generating func- 
tion M* 3 (i) = (1 



. d ScACI 



Ns iu{2) ) ' ■ N s represents 
the data samples with A = a. As a result, Acaci can 
be considered as a weighted sum of independent gamma 
random variates. Therefore the moment generating function 
of Acaci is given by 

M AcACI (t) = JjM* 3 (p(l=a)t) 

a 

But p(A = a) = N s /N so that p(A = a)/N s = 1/N. 
Therefore, 



. d t'cACI 



N 



= (l--ln(2)) ^ (12) 

which is the moment generating function of the gamma 
distribution with scale parameter = l/(N ln(2)) and shape 
parameter = \A\df C aci 1^- ^ 
Using theorem [TT] the Pvalue of an observed CACI 
value t is calculated as Prob(C ACI > t). 

8.4 Significance of Interaction Information 

Determining a closed form expression of the KWII 
is difficult as KWII involves alternating sums of the 
entropies of all possible subsets unlike TCI and CACI. We 
therefore resort to a permutation strategy to calculate the 
significance (i.e. the Pvalue) of KWII of a set of attributes. 
The strategy is slightly different for unsupervised and 
supervised analysis. First consider the case of unsupervised 
analysis. Assume that we want to calculate the significance 
of t = KWII{Ai;A 2 ; ■ ■ ■ ; A k ) = KWII(u). Let X be the 
attribute from the set {Ai; A2; ■ ■ ■ ;A k } with the minimum 
number of states. For supervised analysis to calculate 
the significance of t = KW 1 1 (Ai\ A 2 ; ■ ■ ■ ; A k ; C) = 
KWII(ijj) (C being the class attribute), let X = C. 
Our permutation procedure will shuffle the states of the 
attribute X. Then the following algorithm calculates the 
Pvalue of value t: 

PERMUTATION^, t) 

1. KWIIactual <~ t\ 

2. Generate NPERM permutations of the data by 
randomly shuffling the states of the attribute X; 

3. Calculate permuted KWIP(ui) for each permuted data 

k 

4. Pvalue <- fraction of all KWIP(uj) > KWII actua r, 

5. return Pvalue; 
end 

Fast Permutations The permutation procedure described 
in the algorithm, if implemented naively, can be very time 
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consuming because for a given combination the KWII 
needs to be computed across the entire data samples re- 
peatedly for each shuffle of the states of the attribute X. 
However for discrete, we can implement the permutation 
in a faster manner. The key observation is that for a given 
combination attributes (and possibly class C), the sufficient 
statistics for computing the KWII (the empirical counts for 
each state for different subsets of the attributes) are present 
in the corresponding contingency table in which the rows 
represent the states of the attributes (except X) while the 
columns represent the states of X. As a result, a shuffle 
of the states of X corresponds to a change in counts in 
the cell of the contingency table such that the row sums 
and column sums are unchanged. Note that, we only need 
to scan the data once to build the contingency table for 
each combination which is required anyway for computing 
the original KWII. Once we have the contingency table 
for a particular combination, we can shuffle the counts in 
the contingency table in the above manner and compute 
the KWII for each shuffled table to compute the permuted 
KWII values. Assume a combination C with k variables 
has b states, then the contingency table T will have b 
cells. Creating T has 0(m x b) complexity where m is the 
sample size of the data. Then KWII(C) requires 0(m x b 
+ 2 k x b) = 0(m x 6) computations (for m » 2 fc ) as 
entropies of all subsets of k variables are computable by 
marginalizing T. Thus the first KWII computation involves 
0(m x b) computations because T is constructed. For each 
permutation, we shuffle the counts in T using an efficient 
algorithm presented in ||36ll which consumes approximately 
0(b) computations, so that for NPERM permutations, 
time complexity if only NPERM x 0(b). Also the KWII 
constitutes the output from the IIM algorithm and we 
anticipate very few interactions to be present in the data, 
so permutations need to be performed on few attribute 
combinations. 

9 Algorithm 

In this section we describe our mining algorithms in details. 
The algorithms developed will be for unsupervised analysis. 
The same algorithms with some modifications can be 
applied for supervised analysis, therefore, the modifications 
for class associated analysis will be described in context. 
Our mining algorithm consists of two stages -(l)Correlation 
Information Miner (or CIM) followed by (2) Interaction 
Information Miner (or IIM). The CIM explores the combi- 
natorial space of attribute sets using a breadth-first search 
(BFS) enumerating a BFS tree where each node represents 
an attribute set {A;; A,-; ■ • • ; Ak} (i < j < ■ ■ ■ < k)(or, 
{Ajj Aj\ ■ ■ ■ ; A k ;C} (i < j < ■ ■ ■ < k) when C is present). 
Next we describe pruning strategies using the concept 
of redundancy and bounds on TCI (or CACI) introduced 
before. 

9.1 Redundancy based pruning 

This pruning strategy is applied to the given data set 
D before starting the BFS strategy using the redundancy 
definition [9] The goal is to remove redundant attributes 



thereby reducing the size of the combinatorial space of 
attribute associations. It consists of (I) For each attribute 
A: € C> compute Red(Ai\Aj) with every other attribute 
Aj e £. If Red(Ai\ Aj) < —A, store A, in a list associated 
with A- This step will create a list of attributes redundant 
with each A denoted as Cover(Ai) (which includes Aj). 
An attribute Aj £ Cover(Ai) is said to be covered by Aj. 
E.g. if A\ is redundant with A 2 ,A^ and Ag, Cover(^4i) 
= {Ai;A 2 ;A 5 ;A s }. (II) Create a smaller data set D 1 
by greedily selecting attribute Aj with highest cardinality 
\Cover(Ai)\ (i.e, covering the maximum number of other 
attributes) until all attributes 6 C are covered. This smaller 
data set will be used as input for the algorithm described 
below. The computation of Red(Ai;Aj) will use either 
definitions Q3] or [16] depending on whether C is absent or 
present in the analysis. 

9.2 Sample Size based pruning 

Given attribute set S = {Ai; ■ ■ • ; A^} and sample size N, 
TCI(S') and KWII(S') are based on empirically estimated 
probabilities distributions of the attributes and their combi- 
nations from set S. Let the cardinality of the set of attribute 
values of S be V. The calculated TCI and KWII are often 
poor estimates when N/V < 5 l37l . Therefore, we prune 
node S when N/V < 5 to reduce the chances of discov- 
ering false positive associations. For example, to evaluate 
TCI of -{Ai;^;^} where attribute takes 3-values, there 
should be least 3 3 x 5 = 135 instances. Similarly, for 
supervised analysis, with S = {A\ \ ■ ■ ■ ; Ak; C}, we prune 
node S when N/V < 5, where V is the cardinality of the 
values of the attributes in set S. 

9.3 Bound based pruning 

First we describe the pruning for unsupervised analysis that 
uses the TCI. 

9.3. 1 TCI Bound based pruning 

For each node S in the search space, we calculate its upper 
and lower bounds before actual TCI(S). Let L(S) be the 
maximum of the lower bounds, U(S) be the minimum of 
the upper bounds and T(S) be the true TCI for node S. Let 
P(v) be the Rvalue for any value v. Note that as L(S) < 
T{S) < U(S), we have P(L(S)) > P(T(S)) > P(U(S)). 
Assume that we have determined if S is a HSC/MSC/NSC. 
We shall employ the procedures Handle HSC and Handle 
MSC/NSC described below to handle each case. In the 
following, in each iteration, NextLevel is a queue that 
collects nodes to be explored in the next iteration of BFS 
and 6 is a set of COI/SCOI output by CIM. 

(1) Handle HSC : Assume that the parent of the given 
node S is not a COI/SCOI. Using property 2 in definition|7] 
if S is a COI, store S in 9 and add node S to NextLevel; 
otherwise, prune subtree rooted at S as at least one subset 
of S has redundant correlation information. If the parent of 
S is a COI/SCOI, using property 2 and 3 in definition [8] if 
S is a SCOI, store S in 9 and add node S to NextLevel; 
otherwise, prune subtree rooted at S as the new attribute 
present in S (and not in its parent) does not significantly 
increase the correlation information. 
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(2) Handle MSC/NSC : If node S is a MSC, S and any 
superset of it cannot be a COI/SCOI. So simply prune the 
subtree rooted at S. If it is a NSC, add S to NextLevel 
to continue the search process. 

Based on the TCI bounds, we have the following cases:- 

1) P(U(S)) < P{L(S)) < amgh ■ S is a HSC. Use 
Handle HSC to handle it. 

2) P{U(S)) < a High < P{L(S)) < a Low : 

3) P(U(S)) < a mgh ,a Low < P(L(S)) : 

4) a Hig h < P(U(S)) < a Low < P(L(S)) : Compute 
the TCI T(5). If Pvalue(T(S)) < a Hlgh , S is a 
HSC, use Handle HSC to handle it. Otherwise use 
Handle MSC/NSC. 

5) a Low < P(U(S)) < P(L(S)) : S is a NSC, use 
Handle MSC/NSC. 

6) a Htgh < P(U(S)) < P(L(S)) < a Low : S is a 
MSC, use Handle MSC/NSC. 

Note that actual TCI computations are required only in 
cases 2,3 and 4 thereby improving computational efficiency. 
Next we describe the CIM algorithm. 
9.3.2 CACI bound based pruning 
The pruning strategy for supervised analysis is very much 
similar to the above case. Therefore, we do not describe it 
in details, rather we highlight the following modifications 
in the above strategy to do the pruning when C is present 
in the analysis. 

1) Now each node in the search space S represents the 
set {Ai; ■ ■ ■ ;Ak',C} and we calculate the upper and 
lower bounds before the actual CACI(S). 

2) We now have procedures Handle HSCca and Han- 
dle MSCca/NSCca- Handle HSC C a operate in ex- 
actly the same fashion as Handle HSC with the 
difference that the definitions of COIca and SCOIca 
are now used. Similarly for Handle MSCca/NSCca- 

3) The six bound based cases mentioned above are 
also applicable with Handle HSCca and Handle 
MSCca/NSCca usage. 

4) Another modification that should be made is for case 
5. When the upper bound U(S) is non-significant 
at a£ OM and U(S) has reached the maximum value 
H(C), the CACI of S or its children can never be 
significant at aLow Therefore S and all its children 
will be NSCca, so we can safely prune the subtree 
rooted at S. 

9.4 The Algorithms 

As before, we first describe the algorithm for unsupervised 
analysis (CIM and IIM algorithms). Then we describe the 
changes to be made in CIM to get CIMca algorithm for 
supervised analysis. 
9.4.1 The CIM Algorithm 

We describe the algorithm for unsupervised analysis, the 
modifications to CIM for supervised analysis will be de- 
scribed separately. We assume that CIM uses the data 
obtained after redundancy removal (section I9.U for all 
computations of correlation information and the upper and 
lower bounds. The inputs are the significance levels an for 



a High and a/, for aLow Lines 2-8 computes the TCI for 
every pair of attributes and stores it in NextLevel only 
if the node is a HSC or a NSC. The HSC are collected 
in 6 to be output. Lines 9-33 explores the combinatorial 
search space in a breadth-first fashion wherein each node 
is evaluated to be a HSC/MSC/NSC and either the subtree 
rooted at the node is pruned or the search process is 
continued depending upon the TCI bound based conditions 
1-6 outlined above. The sample size based pruning takes 
place in line 14. 

Algorithm CIM(a H ,aL) 
Input: an,aL 

Output: 9(set of COI and SCOI) 

1. NextLevel <— </>;6 <- </>; 

2. for attribute pair S = {A^ Aj} do 

3. if (P(TCI(S)) < a H ) 

4. Add S to NextLevel,Q; 

5. elseif(P{TCI(S))>a L ) 

6. Add S to NextLevel; 

7. endif 

8. endfor 

9. while NextLevel ^ empty do 

10. CurrLevel <— NextLevel; 

11. NextLevel <— <fi; 

12. for each P € CurrLevel do 

13. for each child S of P do 

14. if not enough samples, goto line 31; 

15. Calculate U(S), L(S), P(U(S)), P(L(S)); 

16. if(P(i(5)) < a H ) do 

17. Handle HSC to update NextLevel,®; 

18. elseif(P([/(<S)) < a H < P(L(S)) < a L ) or 

19. (P{U(S)) < a H ,a L < P{L(S))) or 

20. (a H < P(U(S)) <a L < P(L(S)))) 

21. T^TCI(S); 

22. if{Pvalue(T) < a H ) 

23. Handle HSC to update NextLevel,®; 

24. else 

25. Handle MSC/NSC to update NextLevel; 

26. endif 

27. elseif((a L < P{U{S)) < P(L(S))) or 

28. (a H < P(U(S)) < P(L(S)) < a L )) 

29. Handle MSC/NSC to update NextLevel; 

30. endif 

31. endfor //for each child 

32. endfor //for each P 

33. endwhile 

34. return 6; 

Next we describe the IIM algorithm that is used to 

compute KWII from the attribute sets output by CIM. 
9.4.2 IIM Algorithm 

Let v C £ be the set of attributes present in (combi- 
nations output by CIM). Let K be maximum order of 
combinations to be explored. Assuming the sample size 
to be N and the cardinality of the set of values of the 
K th order combination to be V, K is chosen such that 
N/V > 5. The following algorithm computes the KWII 
of attribute sets of order < K. 
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Algorithm IIM(y, K) 

Input: v (set of attributes present in 0), K (order of the 

largest attribute set e 6) 
Output: A(set of combinations and their KWII) 

1. A <— entropies of all subsets r of v s.t. |r| < K; 

2. for Ai e v do 

3. for each subset X of is/{A t }, s.t. \X\ < K do 

4. A(A U {A,}) <- A(A U {Ai}) - A(X) 

5. endfor 

6. endfor 

7. return A; 

In IIM, the array A is indexed by attribute combina- 
tions. We initialize A with entropies of all subsets of 
v containing upto K attributes (line 1). For example, 
with 3 attributes Ai,A 2 ,A 3 and K = 2, A({Ai}) = 
H(Ai), A({A 2 }) = H(A 2 ), A({A 3 }) = H(A 3 ), 
A({A 1] A 2 }) = H(A i; A 2 ), A({A i; A 3 }) = H(A i; A 3 ) 
and A({A 2 ;A 3 }) = H(A 2l A 3 ). In the end, A shall contain 
negative of KWII values for each attribute combination. 

9.4.3 The CIM CA Algorithm 

Very few modifications are required to CIM to use it for 
class associated analysis (the CIMca) algorithm):- (l)Use 
CACI(S) instead of TCI(S) computations, (2) substitute 
Handle HSC and Handle MSC/NSC procedures with 
Handle HSC C a and Handle MSC C a/NSC C a procedures 
respectively, (3) use COIca and SCOIca, (4) use S = 
{Ai\ Aj;C} in line 2, and (4) in lines 27-30 do not call 
procedure Handle MSC C a/NSC C a if U(S) = H(C) and 
a L < P (U(S) ) < P(L(S)) (using condition 4 from 
subsection 19.3.2b . 

9.4.4 Modifications to IIM 

For supervised analysis, the combinations output by CIMca 
algorithm will constitute the input for IIM. We do not need 
any changes to IIM described above. Let v C £ be the set of 
attributes (including C) present in 9 (combinations output 
by CIMca)- Let K be the maximum count of predictor 
attributes in combinations to be explored. With these inputs, 
the IIM algorithm can be used unchanged for supervised 
analysis. Once the set of combinations and their KWII 
is output by IIM, remove those combinations that do not 
contain C. The remaining combinations shall contain neg- 
ative of KWII values for combinations containing predictor 
attributes and class C. 

10 Experimental Results 

In this section we present the experimental result to high- 
light the performance of our algorithms. In all our exper- 
iments, unless otherwise stated, we have set parameters 
O-High, o-Low and A to 10 -8 , 10~ 2 and 0.75 respectively. 
One can set the a's depending on the experiment and 
data size, e.g. one would set them conservatively to adjust 
for multiple comparisons. The A can be set to a value 
> 0.7 depending on how much redundancy one wants to 
remove from the data. Also, for all experiment, the 10,000 
permutations are used for evaluating the significance of 
each KWII at a significance level Of 0.0001. We use NIFS 
and mRMR for comparison purposes. NIFS lPT4l was run 
with parameter values a = 0.2 and f3 = 0.8 as used in the 
paper. 



10.1 Unsupervised Analysis 

10.1.1 Experiment 1 

Here, we evaluate the effectiveness of our mining methods 
in detecting attribute associations using a synthetic data set 
in the absence of a class variable. The data consists of 15 
binary attributes and 200 samples and three associations 
are planted in the data. The associations embedded in the 
data are (l)A 1 = A 2 © A 3 ,(2)A 6 = A 7 © A 8 © A 9 , and 
(3) An = A12 © A13 © A14 where © denotes exclusive- 
or operation. In addition, noise is added by flipping each 
of A±, Aq and An with error probability p. We repeat the 
experiment 100 times. Figure and B show the TCI and 
significant KWII mined by CIM and IIM, respectively for 
p = 0.1. The significance of each KWII was determined us- 
ing a pvalue of 0.001. The results are presented graphically 
as a spectra of TCI/KWII values plotted against attribute 
combinations. Utilizing statistical significance based min- 
ing, CIM successfully identifies the embedded associations 
exactly. The KWII spectra only contains the three strongest 
associations (pvalue < 0.001). Figure |2]Z! shows that % of 
combinations with significant correlation information de- 
tected by CIM/IIM and is compared with the two methods 
NIFSHt) and mRMRHJ. The error probability p is varied 
as 0,0.1,0.15,0.2,0.25 and 0.3. Using hard thresholds, NIFS 
fails to detect the attribute associations when the strength 
of an associations varies due to noise in the data. CIM/IIM 
solves this problem by mining with statistical significance 
levels instead of threshold values. The other method mRMR 
finds subset of attributes with minimal redundancy among 
the attributes and a class label attribute. As mRMR requires 
a class attribute, we have run mRMR separately with: (1) 
A\, (2) Aq, and (3) An as the class attribute. However, 
mRMR performs poorly (even at p=0.0) because mRMR 
uses only mutual information between each attribute and 
the class to identify the associations. 
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Fig. 2: (A) TCI spectra (B) KWII spectra (C) Comparison 
of CIM/IIM with NIFS and mRMR. 
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10.1.2 Experiment 2 

This experiment is derived from a genetic interaction exper- 
iment and mimics the case of pure epistasis ||38l between 
two SNPs affecting a disease trait. A SNP is a DNA 
sequence variation in a base pair position at which different 
sequence alternatives (alleles) exist among individuals in 
some population. The set of SNPs on a single chromosome 
of a pair of homologous chromosomes is referred to as a 
haplotype, and two haplotypes taken together constitutes 
a genotype. Each SNP usually has two alleles (e.g. A, 
a) resulting in three genotype values (AA, Aa, aa). In a 
case-control experiment, the disease trait is usually binary 
(0=healthy, l=diseased). The simulated data in this experi- 
ment consists of 16 discrete attributes: A\ — A\^ represent 
SNPs each having 3 genotypic states and A\§ representing 
the disease trait is binary. The data consists of 100 samples 
of Ai6=0 and A\§=\ each. 



A,=1 A,=2 4i=3 



a 2 =i 

A 2 =2 
A 2 =3 



0.0 0.0 
0.0 0.2 
0.4 0.0 



0.4 
0.0 
0.0 



The variables Au A 2 are 
interacting with A i(r For 
A u the frequency of each 
state i.e. P(A=1), P(/ii=2) 
P{^4i^3) are 0.25, 0.5 and 0.25 respectively. Similarly 
for A. Each cell gives the probability P(A l6 =\\Ai, A{), 
e.g.P(ii6=l|4i=l,J 2 =3) = 0.4. 



q TCI Spectra 
A 
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Fig. 3: Association model & spectra in Experiment 2 for 
Unsupervised Analysis. 



are created in the data between Ai, A 2 and C using the 
model in Figure [4] This results in the following significant 
combinations:- (Ci) {A X ;C} and (C 2 ) {A 1 ;A 2 ;C}. This 
experiment is simulated such that C\ is a COIca and C 2 
is a SCOIca- Both CIM C a and IIM C a are able to identify 
both the associations (Figure |4| with 100% detection ability 
and < 5% false combinations in 100 repetitions of the 
experiment. However, NIFS fail to identify any interaction 
as it uses hard thresholds. Note that in this case, A 2 can 
only be identified in combination with C\, so that C 2 con- 
tains information about A 2 not present in C\ . When we run 
mRMR with C as the class attribute, it detects combinations 
{Ai;C} with 100% detection ability. However, because 
the mutual information {A 2 ;C} is very weak (« 0.003) 
mRMR detects {A 2 ;C} with 10% detection ability. 

KWI / PAI Spectra 
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The variables A\, Ai are 
interacting with C. For Ai , < 
the frequency of each state £ 
i£.FCA]=l),P(Ai=2)and | 
P(Ai=3) are 0.25, 0.5 and 0.25 respectively. For A 2 , the * 
values of P(Az=l), P(A>=2) and P(As=3) are 0.01, 0.18 
and 0.81 respectively. Each cell gives the probability 
P(C=1IA, Aj),e.g. P(C=llAi=l, A2=3) =0.31. 
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Fig. 4: Association model & spectra in Experiment 1 for 
Supervised Analysis. 



Associations are created in the data between Ai, A 2 
and Ai6 using the model in Figure [3] This results in the 
following significant combinations:- (Ci) {Ai;^} and 
(C 2 ) {Ai; A 2 ; A w }. Note that C\ is a COI and C 2 is a 
SCOI. Both CIM and IIM are able to identify both the 
associations (spectra shown in Figure [5J\ and B). NIFS 
identifies only C\ because it assumes that any superset 
of a set with strong correlation information contains re- 
dundant information. However, in this case, A\§ can only 
be identified in combination with Ci, so that C 2 contains 
information about ^4 16 not present in C\. Finally, we 
run mRMR with A le as the class attribute. Combinations 
{AijAig} and {A2;Ai6} have extremely weak mutual 
information of 0.008 and 0.003 respectively. As mRMR 
depends on mutual information between each attribute and 
the class, it fails to identify any combination involving A\ 
and A 2 . 

10.2 Supervised Analysis 

In this section, we describe the experimental results using 
our algorithms for supervised analysis (i.e. class attribute 
is present). 

10.2. 1 Experiment 1 

In this experiment, the simulated data consists of 15 discrete 
attributes: Ay — A 15 each having 3 states and class attribute 
C. Each of the attributes can be thought to represent 
genotypes and C represents a binary disease trait. The data 
consists of 300 samples of C=0 and C=l each. Associations 



10.2.2 Experiment 2 

The purpose of this experiment is to demonstrate the 
effectiveness of the redundancy based pruning strategy. In 
this experiment, the data consists of 15 discrete attributes 
Ai - A 15 each with 3 states and a class attribute C. A set 
of complex attribute associations is created involving A\, 
A 2 , A3 and C. Each of Ai,A 2 ,A$ represent SNPs having 
genotypic states AA/Aa/aa, BB/Bb/bb and CC/Cc/cc 
respectively. C stands for a binary disease trait. In addition, 
redundancy is added by replicating A\ to A§, A 2 to A-j and 
A3 to Ag with 5% error. The data consists of 800 samples of 
C=0 and C=l each. The rule that causes C to be 1 and the 
CACI and KWII spectra obtained by CIM C a and IIM C a are 
shown in Figure [5] Note that we have effectively removed 
the redundant attributes (^46,^7 5 ^8) and identified all the 
interacting attributes. Also observe that the KWII spectra 
complements the CACI spectra by discovering associations 
like {Ai; A3; C} and {A\ \ A 2 ; A3; C} that are not present 
in the CACI spectra. Confounded by redundancy, NIFS 
generates 150 combinations containing attributes Ai — A±q 
and C, but does not contain any combination from the 
CACI spectra identified by CIMca as their magnitudes are 
less than 0.8. mRMR is run with C as the class attribute and 
it identifies attributes A\,Az,C in associations {Ai;C} 
and {A3;C} but not A 2 because the mutual information 
{A 2 ;C} is only 0.0008. These show that CIM CA and 
IIMca effectively remove redundancy and are capable of 
identifying a diverse range of class associated attribute 
associations. 
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if(A.,=CC or cc) 

if(A]=AA or aa) and (A 2 =BB or bb) 

A, 6 = 1 

iffA^Aa and (A 2 =BB or Bb) 

A, 6 = 1 
else 

if((A,=AA or aa) and A,=Bb) 

A 16 =l 
if(A[=Aa and BB) 
A, 6 = 1 

endif 




Combinations 



Combinations 



Fig. 5: Association model & spectra in Experiment 2 for 
Supervised Analysis. 



10.3 Runtime Evaluation 

We have used the following two data sets to evaluate 
the efficiency of our pruning methods:-(l) Crohns disease 
dataset |39l is derived from the 616 kilobase region of 
human Chromosome 5q31 that may contain a genetic 
variant responsible for Crohns disease by genotyping 103 
SNPs and contains 144 case and 243 control individuals. 
(2) Tick-borne encephalitis dataset [40] consists of 58 SNPs 
genotyped from DNA of 26 patients with severe tick-borne 
encephalitis virus-induced disease and 65 patients with mild 
disease. Figure [6] shows the runtime of our mining method 
(CIM followed by IIM) under the redundancy based and 
TCI based pruning strategies as well as when both are 
applied together and none is applied. For both data sets, 
the missing values were imputed with the most frequent 
value for that particular SNP. Sample size based pruning 
is assumed to be active in all the cases. The number of 
attributes is varied as follows: for each data set, from 
the set of N attributes, a set of K attributes (K = 10, 
20, 30, 40) is randomly selected and removed from the 
original data. The experiment is repeated 10 times for each 
data set and the average runtime for each set of N — K 
attributes is shown. We observe that, the runtime is least 
when both pruning strategies are active (green, circles). 
TCI based pruning (blue, squares) achieves better efficiency 
than redundancy based (red, rhombuses) pruning in both 
data sets and the runtime increases exponentially when no 
pruning is applied. These demonstrates the effectiveness 
of our pruning methods, as the potential search space is 
exponential in the number of attributes. The results from 
similar experiment with supervised analysis is shown in 
Figure [7] 

10.4 Analysis of Crohn's Disease 

We assess the potential of our mining methods (CIM 
followed by IIM and CIM C a followed by IIM C a) for 
identifying key SNPs involved in the causation of Crohn's 
disease using data set from Daly et al ||39l . The Crohns 
disease dataset 1391 is derived from the 616 kilobase region 
of human Chromosome 5q31 that may contain a genetic 
variant responsible for Crohns disease by genotyping 103 
SNPs and contains 144 case and 243 control individuals. 
The 103 SNPs in the data are numbered to 102. Rioux 
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Fig. 6: Runtime Evaluation for Unsupervised Analysis with 
(A) Tick (B) Crohn's Disease. 



180 

150 

?120 

a 

in 

^ 90 
E 

P 60 
3 



-•-Both 

-—TCI bound bas 
Redundancy b 
Neither 



40 50 
No. of Attributes 



1800 
1500 

§1200 
| 900 
= 600 



-•-Both 




4+ TCI Bound Based 


-♦- Redundancy Based 




Neither 




_-< 

'K~ — -& '1 


_ ♦ 





80 90 
No. of Attributes 



Fig. 7: Runtime Evaluation for Supervised Analysis with 
(A) Tick (B) Crohn's Disease. 



et al. ED found 11 SNPs (7Gi?2055a_l, IGR2060a_l, 
IGR2063b_l, IGR2078a_l, IGR2096a_l, IGR2198a_l, 
IGR2230a_l, IGR2277a_l, IGR2081a_l, IGR3096a_l 
and 7Gi?3236a_l) with alleles that were associated with 
risk of Crohn disease. Nine of 11 significant SNPs are 
present in the data set we analyzed; SNPs IGR2078a_l 
and IGR2277a_l are missing. For association information 
mining, subjects and SNPs with missing genotypes are 
eliminated resulting in 40 SNPs with 58 cases and 92 
controls. We perform the following two analyses with the 
data - (1) Mine the association information in the data 
without the disease phenotype i.e. unsupervised analysis. 
(2) Mine the association information using our super- 
vised approach with the disease phenotype as the class 
attribute. In our first analysis, we identify three SNPs 
IGR2055a_l, IGR2230a_l and IGi?3236a_l among the 
combinations with significant KWII. In the second anal- 
ysis where we take the case/control status into account, 
the five SNPs iGi?2198a_l, 7GR2055a_l, 7Gi?3236a_l, 
IGi?2081a_l and 7Gi?2230a_l are found among the 
{SNP, Phenotype} and {SNP, SNP, Phenotype] com- 
binations with significant KWII. On closer examina- 
tion of the data, we found that due to high linkage 
disequilibrium in the genomic region examined, SNPs 
IGi?2066a_l, 7GR20636_1 and 7Gi?2096a_l belonged 
to Cover(IGR20bha_\) while IGi?3096a_l belonged to 
Cover(IGR2230a_l) and were pruned during the redun- 
dancy based pruning phase of our mining method. However, 
each of these SNPs is covered by a representative SNP 
included in the data, as a result, these SNPs and their asso- 
ciated interactions can be easily recovered using the Cover 
data structure after II M completes. For example, consider 
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SNPs IGR2055a_l and IGR2066a_l. If IGR2055a_L 
forms a combination {IGR2055a_l; IGR2198a_l; C} 
with significant KWII, as we have SNP IGR2066a_l £ 
Cover(SNP IGR2055a_l), for SNP IGR2066a_L, we can 
get the combinations with high interaction information as 
{IGR2066a_l;C} and {IGR2066a_l;IGR2198_l;C} 
and test their significance. 

11 Discussion 

In this paper, we have analyzed the problem of mining 
significant association information between attributes in a 
data set for both supervised and unsupervised data analysis 
and have presented novel methods to mine the two types of 
association information - correlation information and inter- 
action information. Specifically, we have derived the distri- 
butional properties of correlation information and bounds 
on correlation information for the supervised case. We have 
also developed a novel method for fast permutations to 
evaluate the significance of interaction information. Using 
several complex experimental and a real data set, we have 
critically evaluated the effectiveness and efficiency of our 
mining strategy. For future work, we would like to explore 
strategies in making our method scalable for handling large 
number of attributes as commonly observed in genetic data 
sets. 
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